Search CORE

60 research outputs found

Speaker Diarization Based on Intensity Channel Contribution

Author: Barra Chicote Roberto
Ferreiros López Javier
Montero Martínez Juan Manuel
Pardo Muñoz José Manuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

Simple4All proposals for the Albayzin Evaluations in Speech Synthesis

Author: Barra-Chicote Roberto
King Simon
Lorenzo-Trueba Jaime
Montero Juan M
Watts Oliver
Yamagishi Junichi
Publication venue
Publication date: 01/01/2012
Field of study

Edinburgh Research Explorer

BOFFIN TTS:Few-Shot Speaker Adaptation by Bayesian Optimization

Author: Aggarwal Vatsal
Barra-Chicote Roberto
González Javier
Moss Henry
Prateek Nishant
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/02/2020
Field of study

We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

Author: Barra-Chicote Roberto
Beringer Grzegorz
Bilinski Piotr
Cook Gary
Vallés-Pérez Ivan
Publication venue
Publication date: 23/07/2023
Field of study

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.Comment: In proceedings of the 26th European Conference on Artificial Intelligence ECAI 2023. 8 pages + 1 appendix pag

arXiv.org e-Print Archive